research problem
The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas After Iterative Paraphrasing?
Shahriar, Sadat, Ayoobi, Navid, Mukherjee, Arjun
With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs' research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated scientific idea remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4\% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.
- North America > United States > Texas > Harris County > Houston (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Africa > Cameroon > Gulf of Guinea (0.04)
Looking Forward: Challenges and Opportunities in Agentic AI Reliability
Xing, Liudong, Janet, null, Lin, null
The AI conversation can be traced as far back as Alan Turing's milestone paper published in 1950, which considered the fundamental question "Can machines think?" [1]. In 1956, AI got its name and mission as a scientific field at the first AI conference held at Dartmouth College [2]. Following AI's foundational period in the 1950s ~ 1970s, AI has evolved from early rule-based systems (1970s ~ 1990s), through classical machine learning and deep learning with neural networks (1990s ~ 2020s), to today's generative and agentic AI systems (since 2010s). Correspondingly, as a vital requirement of these systems, the reliability concept and concerns are also evolving, particularly in the interpretation of "required function" (see Table 1 in Chapter 10), based on the definition in standards like ISO 8402 "The ability of an item to perform a required function, under given environmental and operational conditions and for a stated period of time ". While a conventional AI system is concerned with providing stable and accurate classifications, predictions, or optimizations, a reliable generative AI system focuses on producing outputs that are trustworthy, consistent, safe, and contextually appropriate [3]. Building on both, a reliable agentic AI system should additionally conduct functions of reasoning, goal alignment, planning, safe adaption and interaction in dynamic and collaborative multi-agent contexts. The expansion of reliability concepts has introduced new challenges and research opportunities, as exemplified in Figure 1. In the following sections, we shed lights on these challenges and opportunities in building reliable AI systems, particularly, agentic AI systems.
- Europe > Sweden > Norrbotten County > Luleå (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Massachusetts > Bristol County > Dartmouth (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration
Zhang, Zhi, Liu, Yan, Hu, Zhejing, Chen, Gong, Zhong, Sheng-hua, Cao, Jiannong
Automating the end-to-end scientific research process poses a fundamental challenge: it requires both evolving high-level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double-Loop Multi-Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best-evolved plan. It dynamically adjusts the plan during implementation via pre-hoc and post-hoc meetings, ensuring each step (e.g., drafting, coding) is well-supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state-of-the-art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- Asia > China > Hong Kong (0.04)
- North America > United States > Illinois (0.04)
- Overview (0.93)
- Research Report > Promising Solution (0.46)
HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance
Vasu, Rosni, Basu, Chandrayee, Mishra, Bhavana Dalvi, Sarasua, Cristina, Clark, Peter, Bernstein, Abraham
Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present $\texttt{HypER}$ ($\textbf{Hyp}$othesis Generation with $\textbf{E}$xplanation and $\textbf{R}$easoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. $\texttt{HypER}$ is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that $\texttt{HypER}$ outperformes the base model, distinguishing valid from invalid reasoning chains (+22\% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts ($>$3.5 on 5-point Likert scale).
- Oceania > Samoa (0.04)
- Oceania > American Samoa (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
A Vision for Auto Research with LLM Agents
Liu, Chengwei, Wang, Chong, Cao, Jiayue, Ge, Jingquan, Wang, Kun, Zhang, Lyuye, Cheng, Ming-Ming, Zhao, Penghai, Li, Tianlin, Jia, Xiaojun, Li, Xiang, Li, Xingshuai, Liu, Yang, Feng, Yebo, Huang, Yihao, Xu, Yijia, Sun, Yuqiang, Zhou, Zhenhong, Xu, Zhengzi
This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.
- Workflow (1.00)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- Research Report > New Finding (0.68)
MIR: Methodology Inspiration Retrieval for Scientific Research Problems
Garikaparthi, Aniketh, Patwardhan, Manasi, Kanade, Aditya Sanjiv, Hassan, Aman, Vig, Lovekesh, Cohan, Arman
There has been a surge of interest in harnessing the reasoning capabilities of Large Language Models (LLMs) to accelerate scientific discovery. While existing approaches rely on grounding the discovery process within the relevant literature, effectiveness varies significantly with the quality and nature of the retrieved literature. We address the challenge of retrieving prior work whose concepts can inspire solutions for a given research problem, a task we define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset tailored for training and evaluating retrievers on MIR, and establish baselines. To address MIR, we build the Methodology Adjacency Graph (MAG); capturing methodological lineage through citation relationships. We leverage MAG to embed an "intuitive prior" into dense retrievers for identifying patterns of methodological inspiration beyond superficial semantic similarity. This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and +4.8 in mAP. Through extensive ablation studies and qualitative analyses, we exhibit the promise of MIR in enhancing automated scientific discovery and outline avenues for advancing inspiration-driven retrieval.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation
Feng, Tao, Sun, Yihang, You, Jiaxuan
The powerful capabilities of Large Language Models (LLMs) have led to their growing use in evaluating human-generated content, particularly in evaluating research ideas within academic settings. Existing solutions primarily rely on prompt-based LLM methods or fine-tuned lightweight language models for idea evaluation. However, these methods are often unstable and struggle to comprehend the complex semantic information embedded in the ideas, impeding their ability to perform high-quality evaluations. To address the above challenges, we propose GraphEval, a lightweight graph-based LLM framework for idea evaluation. Our insight is that a complex idea can be broken down into comprehensible viewpoint nodes using prompts from small LLMs. These viewpoint nodes can then be linked together through edges created from LLM-based relation extraction and/or BERT similarity scores. The created viewpoint-graph can be used to conveniently propagate scores across view-nodes to improve the robustness of the idea evaluations. In particular, we propose two lightweight graph-based methods for idea evaluation: (1) GraphEval-LP: a training-free label propagation algorithm that propagates evaluation scores from known view-nodes to unknown nodes; (2) GraphEval-GNN: a Graph Neural Networks (GNN) that is trained to predict the evaluation scores given the observed graph with minimal computation resources. Moreover, to overcome LLM's limitation in objectively assessing the novelty of ideas, we further propose a novelty detection model to GraphEval-GNN to enhance its capability in judging idea novelty. Experiments on two datasets show GraphEval improves F1 scores by at least 14% with low computation and API costs. Additionally, GraphEval can effectively detect plagiarized ideas.
- North America > United States > Illinois (0.04)
- Europe > Austria > Tyrol > Innsbruck (0.04)
- Africa > South Sudan > Equatoria > Central Equatoria > Juba (0.04)
Theoretical Physics Benchmark (TPBench) -- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
Chung, Daniel J. H., Gao, Zhiqi, Kvasiuk, Yurii, Li, Tianyi, Münchmeyer, Moritz, Rudolph, Maja, Sala, Frederic, Tadepalli, Sai Chaitanya
We introduce a benchmark to evaluate the capability of AI to solve problems in theoretical physics, focusing on high-energy theory and cosmology. The first iteration of our benchmark consists of 57 problems of varying difficulty, from undergraduate to research level. These problems are novel in the sense that they do not come from public problem collections. We evaluate our data set on various open and closed language models, including o3-mini, o1, DeepSeek-R1, GPT-4o and versions of Llama and Qwen. While we find impressive progress in model performance with the most recent models, our research-level difficulty problems are mostly unsolved. We address challenges of auto-verifiability and grading, and discuss common failure modes. While currently state-of-the art models are still of limited use for researchers, our results show that AI assisted theoretical physics research may become possible in the near future. We discuss the main obstacles towards this goal and possible strategies to overcome them. The public problems and solutions, results for various models, and updates to the data set and score distribution, are available on the website of the dataset tpbench.org.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- North America > United States > Indiana (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Promising Solution (0.87)
- Education > Educational Setting (0.67)
- Government > Regional Government (0.45)
Evaluating improvements on using Large Language Models (LLMs) for property extraction in the Open Research Knowledge Graph (ORKG)
Current research highlights the great potential of Large Language Models (LLMs) for constructing Scholarly Knowledge Graphs (SKGs). One particularly complex step in this process is relation extraction, aimed at identifying suitable properties to describe the content of research. This study builds directly on previous research of three Open Research Knowledge Graph (ORKG) team members who assessed the readiness of LLMs such as GPT-3.5, Llama 2, and Mistral for property extraction in scientific literature. Given the moderate performance observed, the previous work concluded that fine-tuning is needed to improve these models' alignment with scientific tasks and their emulation of human expertise. Expanding on this prior experiment, this study evaluates the impact of advanced prompt engineering techniques and demonstrates that these techniques can highly significantly enhance the results. Additionally, this study extends the property extraction process to include property matching to existing ORKG properties, which are retrieved via the API. The evaluation reveals that results generated through advanced prompt engineering achieve a higher proportion of matches with ORKG properties, further emphasizing the enhanced alignment achieved. Moreover, this lays the groundwork for addressing challenges such as the inconsistency of ORKG properties, an issue highlighted in prior studies. By assigning unique URIs and using standardized terminology, this work increases the consistency of the properties, fulfilling a crucial aspect of Linked Data and FAIR principles - core commitments of ORKG. This, in turn, significantly enhances the applicability of ORKG content for subsequent tasks such as comparisons of research publications. Finally, the study concludes with recommendations for future improvements in the overall property extraction process.
- North America > United States > Nevada (0.04)
- Europe (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.94)
- Workflow (0.93)
The Superalignment of Superhuman Intelligence with Large Language Models
Huang, Minlie, Wang, Yingkang, Cui, Shiyao, Ke, Pei, Tang, Jie
We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more popular, a critical question arises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We then present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing. We discuss some important research problems in each component of this framework and highlight some interesting research ideas that are closely related to our proposed framework, for instance, self-alignment, self-play, self-refinement, and more. Last, we highlight some future research directions for superalignment, including identification of new emergent risks and multi-dimensional alignment.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (2 more...)